Exploratory Data Analysis

Exploring Coal Consumption Data

Exploring the elements of the average temperature time series is required to move on. So that further models may take these components into account, it is crucial to understand how each time series component affects the entire time series. Time series data primarily consist of four elements: level, trend, seasonality, and noise. Noise and unpredictability are nonsystematic components, whereas level, trend, and seasonality are systematic. A greater comprehension of a time series may be attained by breaking it down into these parts.

In this exploratory data analysis, the plots will be looking at a time series for the quarterly US total coal consumption only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only US for simplification purposes. It also makes sense since all the states follow the same trend of coal consumption.


Time Series Plot

Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time.

Code
coal_df <- read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/coal_us_consumption.csv") %>%
  group_by(period) %>%
  summarise(Consumption = sum(consumption)) %>%
  filter(!period %in% c('2000-Q1', '2000-Q2',  '2000-Q3', '2000-Q4'))

coal_df_ts <- ts(coal_df$Consumption, start = c(2001,1), end = c(2021,1), frequency = 4)
Code
theme_set(theme_gray(base_size=12,base_family="Palatino"))
autoplotly(coal_df_ts) +
  ggtitle("US Quarterly Coal Consumption") +
  xlab("Year (2001-2021)") +
  ylab("Short Ton")


Decomposition

decompose() Function

Multiplicative trend means the trend is not linear (curved line), and multiplicative seasonality means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from decompose() to decompose the coal consumption.

Code
decompose_coal = decompose(coal_df_ts, "multiplicative")
autoplotly(decompose_coal) +
  ggtitle("Decomposition of US Total Coal Consumption")


stl() Function

In R the stl() function performs decomposition of a time series into seasonal, trend and irregular components using loses. stl() will handle any type of seasonality, not only monthly and quarterly data.

Code
coal_df_ts %>%
  stl(s.window="periodic", robust=TRUE) %>%
  autoplotly()

The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show us the US quarterly coal consumption data has seasonality. Before 2010, the trend of the data is increasing. After 2010, the trend of the data is decreasing.

Since the data has a trend, it doesn’t has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.


Lag Plots

A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly US coal consumption data.

Code
library(wesanderson)
gglagplot(coal_df_ts, do.lines=FALSE, set.lags = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)) + 
  xlab("Lags") + 
  ggtitle("Lag Plot of US Total Coal Consumption") +
  theme_minimal() +
  ylab("Coal Consumption in short tons") +
  theme_light() +
  theme(text=element_text(size=12, family="Palatino")) +
  labs(fill = "Legend") +
  scale_color_brewer(palette="Set2")

From lag plots here we can see that the coal consumption data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it’s hard to spot the seasonality from the lag plots, which makes me believe the data doesn’t have seasonality.


Autocorrelation and Stationarity

The function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.

Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.

ACF Plot

Code
acf <- ggAcf(coal_df_ts, 80) + ggtitle("ACF Plot of US Total Coal Consumption")
ggplotly(acf)

Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 4th quarter, 8th quarter, and 12th quarter, etc. It means the data has yearly seasonality.

A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.

PACF Plot

Code
pacf <- ggPacf(coal_df_ts, 50) + ggtitle("PACF Plot of US Total Coal Consumption")
ggplotly(pacf)

From the PACF plot, we can see a large spike at lag 1 that decreases after a few lags, which indicates a moving average term in the data.

Stationarity Test

Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.

ADF test is conducted with the following assumptions.

Null Hypothesis (HO): Series is non-stationary or series has a unit root.

Alternate Hypothesis(HA): Series is stationary or series has no unit root.

If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.

Conditions to Reject Null Hypothesis(HO)

Code
adf.test(coal_df_ts)

    Augmented Dickey-Fuller Test

data:  coal_df_ts
Dickey-Fuller = -3.2102, Lag order = 4, p-value = 0.09218
alternative hypothesis: stationary

The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the US total coal consumption data is not stationary. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.


Differencing and Detrending

Number of Differences and Seasonal Differences Estimation

KPSS Test

In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.

Code
library(urca)
coal_df_ts %>% ur.kpss() %>% summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 3 lags. 

Value of test-statistic is: 1.7562 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The test statistic is much bigger than the 1% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.

Code
coal_df_ts %>% log() %>% diff(lag=4) %>% diff() %>% ur.kpss() %>% summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 3 lags. 

Value of test-statistic is: 0.0871 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

This time, the test statistic is tiny, and well within the range we would expect for stationary data. So we can conclude that the differenced data are stationary. Further more, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference has the lowest result 0.087.


A similar function for determining whether seasonal differencing is required is nsdiffs(), which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.

We can apply nsdiffs() to the logged US Total Coal Consumption data.

Code
coal_df_ts %>% log() %>% diff() %>% nsdiffs()
[1] 1
Code
coal_df_ts %>% log() %>% diff(lag=4) %>% ndiffs()
[1] 1

Because nsdiffs() returns 1 (indicating one seasonal difference is required), we apply the ndiffs() function to the seasonally differenced data. These functions suggest we should do both a seasonal difference and a first difference.

Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

Code
cbind("Consumption" = coal_df_ts,
      "Quarterly log" = log(coal_df_ts),
      "Annual change in log" = diff(log(coal_df_ts),lag=4),
      "Doubly\n differenced logs" = diff(diff(log(coal_df_ts),lag=4))) %>%
  autoplot(facets=TRUE) +
    xlab("Year") + ylab("") +
    ggtitle("Differencing of US Total Coal Consumption")

New Stationary Data

Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.

Code
`Log Transformation` <- log(coal_df_ts)
`Remove Seasonality` <- diff(log(coal_df_ts), lag=4)
`First Diff After Remove Seasonality` <- diff(diff(log(coal_df_ts), lag=4))

a <- ggAcf(coal_df_ts,70) + ggtitle("Original Data")
b <- ggAcf(`Log Transformation`,70) + ggtitle("Log Transformation")
c <- ggAcf(`Remove Seasonality`,70) + ggtitle("Remove Seasonality")
d <- ggAcf(`First Diff After Remove Seasonality`,70) + ggtitle("First Diff After Remove Seasonality")

ACF Plots

Original and Log Transformation

Code
ggarrange(a, b, ncol = 1, nrow = 2)

Detrend and Remove Seasonality

Code
ggarrange(c, d, ncol = 1, nrow = 2)

The first three graphs above are not as ideal as the last ACF graph which is First Diff After Remove Seasonality.

For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.

This ACF graph of detrended and differenced US total coal consumption data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.